Gemma/[Gemma_2]Deploy_with_vLLM.ipynb (522 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "Tce3stUlHN0L"
},
"source": [
"##### Copyright 2024 Google LLC."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "tuOe1ymfHZPu"
},
"outputs": [],
"source": [
"# @title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dfsDR_omdNea"
},
"source": [
"# Gemma - deploy with vLLM\n",
"\n",
"This notebook demonstrates how you can deploy a Gemma model with [vLLM](https://github.com/vllm-project/vllm) and query it. vLLM is a fast and easy-to-use library for LLM inference and serving, and has built-in support for Gemma deployment.\n",
"\n",
"<table align=\"left\">\n",
" <td>\n",
" <a target=\"_blank\" href=\"https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/[Gemma_2]Deploy_with_vLLM.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
" </td>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MwMiP7jDdAL1"
},
"source": [
"## Setup\n",
"\n",
"### Select the Colab runtime\n",
"To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU:\n",
"\n",
"1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.\n",
"2. Select **Change runtime type**.\n",
"3. Under **Hardware accelerator**, select **T4 GPU**.\n",
"\n",
"\n",
"### Gemma setup on Hugging Face\n",
"vLLM uses Hugging Face under the hood. So you will need to:\n",
"\n",
"* Get access to Gemma on [huggingface.co](huggingface.co) by accepting the Gemma license on the Hugging Face page of the specific model, i.e., [Gemma 2B](https://huggingface.co/google/gemma-2b).\n",
"* Generate a [Hugging Face access token](https://huggingface.co/docs/hub/en/security-tokens) and configure it as a Colab secret 'HF_TOKEN'."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gJaQ-OVoPKCo"
},
"source": [
"## Installation\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "DHrOMaOAPSAM"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: vllm in /usr/local/lib/python3.10/dist-packages (0.4.2)\n",
"Requirement already satisfied: cmake>=3.21 in /usr/local/lib/python3.10/dist-packages (from vllm) (3.27.9)\n",
"Requirement already satisfied: ninja in /usr/local/lib/python3.10/dist-packages (from vllm) (1.11.1.1)\n",
"Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from vllm) (5.9.5)\n",
"Requirement already satisfied: sentencepiece in /usr/local/lib/python3.10/dist-packages (from vllm) (0.1.99)\n",
"Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from vllm) (1.25.2)\n",
"Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from vllm) (2.31.0)\n",
"Requirement already satisfied: py-cpuinfo in /usr/local/lib/python3.10/dist-packages (from vllm) (9.0.0)\n",
"Requirement already satisfied: transformers>=4.40.0 in /usr/local/lib/python3.10/dist-packages (from vllm) (4.41.1)\n",
"Requirement already satisfied: tokenizers>=0.19.1 in /usr/local/lib/python3.10/dist-packages (from vllm) (0.19.1)\n",
"Requirement already satisfied: fastapi in /usr/local/lib/python3.10/dist-packages (from vllm) (0.111.0)\n",
"Requirement already satisfied: openai in /usr/local/lib/python3.10/dist-packages (from vllm) (1.30.2)\n",
"Requirement already satisfied: uvicorn[standard] in /usr/local/lib/python3.10/dist-packages (from vllm) (0.29.0)\n",
"Requirement already satisfied: pydantic>=2.0 in /usr/local/lib/python3.10/dist-packages (from vllm) (2.7.1)\n",
"Requirement already satisfied: prometheus-client>=0.18.0 in /usr/local/lib/python3.10/dist-packages (from vllm) (0.20.0)\n",
"Requirement already satisfied: prometheus-fastapi-instrumentator>=7.0.0 in /usr/local/lib/python3.10/dist-packages (from vllm) (7.0.0)\n",
"Requirement already satisfied: tiktoken==0.6.0 in /usr/local/lib/python3.10/dist-packages (from vllm) (0.6.0)\n",
"Requirement already satisfied: lm-format-enforcer==0.9.8 in /usr/local/lib/python3.10/dist-packages (from vllm) (0.9.8)\n",
"Requirement already satisfied: outlines==0.0.34 in /usr/local/lib/python3.10/dist-packages (from vllm) (0.0.34)\n",
"Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from vllm) (4.11.0)\n",
"Requirement already satisfied: filelock>=3.10.4 in /usr/local/lib/python3.10/dist-packages (from vllm) (3.14.0)\n",
"Requirement already satisfied: ray>=2.9 in /usr/local/lib/python3.10/dist-packages (from vllm) (2.23.0)\n",
"Requirement already satisfied: nvidia-ml-py in /usr/local/lib/python3.10/dist-packages (from vllm) (12.550.52)\n",
"Requirement already satisfied: vllm-nccl-cu12<2.19,>=2.18 in /usr/local/lib/python3.10/dist-packages (from vllm) (2.18.1.0.4.0)\n",
"Requirement already satisfied: torch==2.3.0 in /usr/local/lib/python3.10/dist-packages (from vllm) (2.3.0+cu121)\n",
"Requirement already satisfied: xformers==0.0.26.post1 in /usr/local/lib/python3.10/dist-packages (from vllm) (0.0.26.post1)\n",
"Requirement already satisfied: interegular>=0.3.2 in /usr/local/lib/python3.10/dist-packages (from lm-format-enforcer==0.9.8->vllm) (0.3.3)\n",
"Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from lm-format-enforcer==0.9.8->vllm) (24.0)\n",
"Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from lm-format-enforcer==0.9.8->vllm) (6.0.1)\n",
"Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (3.1.4)\n",
"Requirement already satisfied: lark in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (1.1.9)\n",
"Requirement already satisfied: nest-asyncio in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (1.6.0)\n",
"Requirement already satisfied: cloudpickle in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (2.2.1)\n",
"Requirement already satisfied: diskcache in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (5.6.3)\n",
"Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (1.11.4)\n",
"Requirement already satisfied: numba in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (0.58.1)\n",
"Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (1.4.2)\n",
"Requirement already satisfied: referencing in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (0.35.1)\n",
"Requirement already satisfied: jsonschema in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (4.19.2)\n",
"Requirement already satisfied: regex>=2022.1.18 in /usr/local/lib/python3.10/dist-packages (from tiktoken==0.6.0->vllm) (2023.12.25)\n",
"Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (1.12)\n",
"Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (3.3)\n",
"Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (2023.6.0)\n",
"Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (12.1.105)\n",
"Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (12.1.105)\n",
"Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (12.1.105)\n",
"Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (8.9.2.26)\n",
"Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (12.1.3.1)\n",
"Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (11.0.2.54)\n",
"Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (10.3.2.106)\n",
"Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (11.4.5.107)\n",
"Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (12.1.0.106)\n",
"Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (2.20.5)\n",
"Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (12.1.105)\n",
"Requirement already satisfied: triton==2.3.0 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (2.3.0)\n",
"Requirement already satisfied: nvidia-nvjitlink-cu12 in /usr/local/lib/python3.10/dist-packages (from nvidia-cusolver-cu12==11.4.5.107->torch==2.3.0->vllm) (12.5.40)\n",
"Requirement already satisfied: starlette<1.0.0,>=0.30.0 in /usr/local/lib/python3.10/dist-packages (from prometheus-fastapi-instrumentator>=7.0.0->vllm) (0.37.2)\n",
"Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2.0->vllm) (0.7.0)\n",
"Requirement already satisfied: pydantic-core==2.18.2 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2.0->vllm) (2.18.2)\n",
"Requirement already satisfied: click>=7.0 in /usr/local/lib/python3.10/dist-packages (from ray>=2.9->vllm) (8.1.7)\n",
"Requirement already satisfied: msgpack<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from ray>=2.9->vllm) (1.0.8)\n",
"Requirement already satisfied: protobuf!=3.19.5,>=3.15.3 in /usr/local/lib/python3.10/dist-packages (from ray>=2.9->vllm) (3.20.3)\n",
"Requirement already satisfied: aiosignal in /usr/local/lib/python3.10/dist-packages (from ray>=2.9->vllm) (1.3.1)\n",
"Requirement already satisfied: frozenlist in /usr/local/lib/python3.10/dist-packages (from ray>=2.9->vllm) (1.4.1)\n",
"Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->vllm) (3.3.2)\n",
"Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->vllm) (3.7)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->vllm) (2.0.7)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->vllm) (2024.2.2)\n",
"Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /usr/local/lib/python3.10/dist-packages (from tokenizers>=0.19.1->vllm) (0.23.1)\n",
"Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.40.0->vllm) (0.4.3)\n",
"Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.40.0->vllm) (4.66.4)\n",
"Requirement already satisfied: fastapi-cli>=0.0.2 in /usr/local/lib/python3.10/dist-packages (from fastapi->vllm) (0.0.4)\n",
"Requirement already satisfied: httpx>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from fastapi->vllm) (0.27.0)\n",
"Requirement already satisfied: python-multipart>=0.0.7 in /usr/local/lib/python3.10/dist-packages (from fastapi->vllm) (0.0.9)\n",
"Requirement already satisfied: ujson!=4.0.2,!=4.1.0,!=4.2.0,!=4.3.0,!=5.0.0,!=5.1.0,>=4.0.1 in /usr/local/lib/python3.10/dist-packages (from fastapi->vllm) (5.10.0)\n",
"Requirement already satisfied: orjson>=3.2.1 in /usr/local/lib/python3.10/dist-packages (from fastapi->vllm) (3.10.3)\n",
"Requirement already satisfied: email_validator>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from fastapi->vllm) (2.1.1)\n",
"Requirement already satisfied: h11>=0.8 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]->vllm) (0.14.0)\n",
"Requirement already satisfied: httptools>=0.5.0 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]->vllm) (0.6.1)\n",
"Requirement already satisfied: python-dotenv>=0.13 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]->vllm) (1.0.1)\n",
"Requirement already satisfied: uvloop!=0.15.0,!=0.15.1,>=0.14.0 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]->vllm) (0.19.0)\n",
"Requirement already satisfied: watchfiles>=0.13 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]->vllm) (0.21.0)\n",
"Requirement already satisfied: websockets>=10.4 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]->vllm) (12.0)\n",
"Requirement already satisfied: anyio<5,>=3.5.0 in /usr/local/lib/python3.10/dist-packages (from openai->vllm) (3.7.1)\n",
"Requirement already satisfied: distro<2,>=1.7.0 in /usr/lib/python3/dist-packages (from openai->vllm) (1.7.0)\n",
"Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from openai->vllm) (1.3.1)\n",
"Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.5.0->openai->vllm) (1.2.1)\n",
"Requirement already satisfied: dnspython>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from email_validator>=2.0.0->fastapi->vllm) (2.6.1)\n",
"Requirement already satisfied: typer>=0.12.3 in /usr/local/lib/python3.10/dist-packages (from fastapi-cli>=0.0.2->fastapi->vllm) (0.12.3)\n",
"Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx>=0.23.0->fastapi->vllm) (1.0.5)\n",
"Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->outlines==0.0.34->vllm) (2.1.5)\n",
"Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema->outlines==0.0.34->vllm) (23.2.0)\n",
"Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema->outlines==0.0.34->vllm) (2023.12.1)\n",
"Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema->outlines==0.0.34->vllm) (0.18.1)\n",
"Requirement already satisfied: llvmlite<0.42,>=0.41.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba->outlines==0.0.34->vllm) (0.41.1)\n",
"Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch==2.3.0->vllm) (1.3.0)\n",
"Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.10/dist-packages (from typer>=0.12.3->fastapi-cli>=0.0.2->fastapi->vllm) (1.5.4)\n",
"Requirement already satisfied: rich>=10.11.0 in /usr/local/lib/python3.10/dist-packages (from typer>=0.12.3->fastapi-cli>=0.0.2->fastapi->vllm) (13.7.1)\n",
"Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich>=10.11.0->typer>=0.12.3->fastapi-cli>=0.0.2->fastapi->vllm) (3.0.0)\n",
"Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich>=10.11.0->typer>=0.12.3->fastapi-cli>=0.0.2->fastapi->vllm) (2.16.1)\n",
"Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer>=0.12.3->fastapi-cli>=0.0.2->fastapi->vllm) (0.1.2)\n"
]
}
],
"source": [
"!pip install vllm"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wn2lC5hVPUxy"
},
"source": [
"## Inference\n",
"\n",
"### Import vLLM"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "q8o4A6QKPp3d"
},
"outputs": [],
"source": [
"from vllm import LLM"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "76mRtotdPu9N"
},
"source": [
"Instantiate vLLM's engine (note that Colab's free T4 GPU only supports `float32`)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "k8y8SD1XPzAr"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
" warnings.warn(\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "a45c956964f441619463ea8cb6ecad5a",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"config.json: 0%| | 0.00/627 [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO 05-24 06:23:04 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='google/gemma-2b', speculative_config=None, tokenizer='google/gemma-2b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float32, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=google/gemma-2b)\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "128bba1e8f4b4e779f4c77bae0f969b2",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"tokenizer_config.json: 0%| | 0.00/33.6k [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "92c14362d70b4dc78c443f7397fa1d71",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"tokenizer.model: 0%| | 0.00/4.24M [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "0605043aa7a445d889338f8aab5d8b6f",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"tokenizer.json: 0%| | 0.00/17.5M [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "99bb3e9715ea463b959d4c7b3a75fcc1",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"special_tokens_map.json: 0%| | 0.00/636 [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "b03b15b10a0049398d9566685e84fa91",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"generation_config.json: 0%| | 0.00/137 [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO 05-24 06:23:06 utils.py:660] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1\n",
"INFO 05-24 06:23:08 selector.py:69] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.\n",
"INFO 05-24 06:23:08 selector.py:32] Using XFormers backend.\n",
"WARNING 05-24 06:23:10 gemma.py:54] Gemma's activation function was incorrectly set to exact GeLU in the config JSON file when it was initially released. Changing the activation function to approximate GeLU (`gelu_pytorch_tanh`). If you want to use the legacy `gelu`, edit the config JSON to set `hidden_activation=gelu` instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.\n",
"INFO 05-24 06:23:10 weight_utils.py:199] Using model weights format ['*.safetensors']\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "163763c99f904eaf80620a6837169a56",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"model-00001-of-00002.safetensors: 0%| | 0.00/4.95G [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "aefe997ca83949628c822f46320ff4ed",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"model-00002-of-00002.safetensors: 0%| | 0.00/67.1M [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"INFO 05-24 06:24:08 model_runner.py:175] Loading model weights took 9.3440 GB\n",
"INFO 05-24 06:24:16 gpu_executor.py:114] # GPU blocks: 1724, # CPU blocks: 7281\n",
"INFO 05-24 06:24:19 model_runner.py:937] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n",
"INFO 05-24 06:24:19 model_runner.py:941] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.\n",
"INFO 05-24 06:24:29 model_runner.py:1017] Graph capturing finished in 11 secs.\n"
]
}
],
"source": [
"llm = LLM(model=\"google/gemma-2b\", dtype=\"float32\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MMg2unvIOtH4"
},
"source": [
"Run inference with a single prompt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_JihVtFsOwsn"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 1.42it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Prompt: The capital of USA is, Gemma output: planning to invest in sustainable transportation and is ready to accept transformational advances in technology\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"prompt = \"The capital of USA is\"\n",
"output = llm.generate(prompt)\n",
"print()\n",
"print(f\"Prompt: {prompt}, Gemma output: {output[0].outputs[0].text}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FlLNziE93xdP"
},
"source": [
"Run batch inference"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "AW-ex-Fgs_La"
},
"outputs": [],
"source": [
"prompts = [\n",
" \"The best thing about California is\",\n",
" \"Physics studies\",\n",
" \"The best place for sightseeing in Japan is\",\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "R8MPTh5XtH0Z"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Processed prompts: 100%|██████████| 3/3 [00:00<00:00, 3.59it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Prompt: The best thing about California is, Gemma output: that there are always new ways to spend time and enjoy the magnificent state as a\n",
"\n",
"Prompt: Physics studies, Gemma output: matter, its properties, how it interacts with other matter, and other fields of\n",
"\n",
"Prompt: The best place for sightseeing in Japan is, Gemma output: Kyoto. It is the very first city you will visit in Japan when you get\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"outputs = llm.generate(prompts)\n",
"\n",
"for output in outputs:\n",
" prompt = output.prompt\n",
" generated_text = output.outputs[0].text\n",
" print()\n",
" print(f\"Prompt: {prompt}, Gemma output: {generated_text}\")"
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"name": "[Gemma_2]Deploy_with_vLLM.ipynb",
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}